Delphi’s forecasting effort:
An Overview

Daniel J. McDonald
Department of Statistics
University of British Columbia




16 July 2021

This Talk

Outline:

  1. Background (context, teams, ensembling)
  2. Our process (currently)
  3. How have we done?
  4. Lessons learned and thoughts

Implemented with

Reproducible talk:

This Talk

Outline:

  1. Background (context, teams, ensembling)
  2. Our process (currently)
  3. How have we done?
  4. Lessons learned and thoughts

Implemented with {evalcast} package + {covidcast} API

Reproducible talk: all code included

1 Background

1 Background (the task)

The task

The goal

The other teams

2 Current practice

2 Producing the forecasts

Workflow

1. Sunday/Monday morning, make sure any data issues we know of are fixed

Workflow

2. Flag and ‘correct’ other anomalies

# in the covid-19-forecast repo
zookeeper::make_state_corrector()

Do these randomly so that we don’t smooth the signal too much.

Making corrections

Some data has other obvious issues …

Making corrections

Some data has other obvious issues …

Making corrections

Daily vs. averages

Forecasters

3. Run the forecaster

States: 0-1-2 weekly lags of deaths and 0-1-2 weekly lags of cases.

Counties:

## List of 3
##  $ cases             : num [1:10] 0 1 2 3 6 9 12 15 18 21
##  $ fb-smoothed-hh-cli: num [1:4] 3 10 17 24
##  $ dv-smoothed-cli   : num [1:4] 3 10 17 24

Quality control

4. Look at the results

The system

5. The submission

Performance

3 How have we done?

ForecastHub - All teams with more than 100 forecasts

Performance over time (only teams that are better than us!)

Calibration (all time)

Calibration (since March 2021)

80% Coverage

Spatial performance

Our forecasts, NY

COVIDhub ensemble, NY

Our forecasts, Utah

COVIDhub ensemble, Utah

Performance

4 Lessons and thoughts

This is really hard

Important lessons:

Out-of-sample evaluation (with proper as-of) is huge.

Modular “forecaster template” is really helpful.

Nonstationarity is hard.

How hard is it?

On an equal footing, the best model beats the baseline by 20%. But, give the baseline 3 weeks of data, then it beats the best model by 20%.

Why so hard?

Quick thoughts on future directions

Thanks


Delphi Carnegie Mellon University